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STRATEGY RECOVERY FOR STOCHASTIC MEAN PAYOFF GAMES 


MARCELLO MAMINO 


Abstract. We prove that to find optimal positional strategies for stochastic mean 
payoff games when the value of every state of the game is known, in general, is as 
hard as solving such games tout court. This answers a question posed by Daniel 
Andersson and Peter Bro Miltersen. 


In this note, we consider perfect information 0-sum stochastic games, which, 
for short, we will just call stochastic games. For us, a stochastic game is a finite 
directed graph whose vertices we call states and whose edges we call transitions, 
multiple edges and loops are allowed but no state can be a sink. To each state s 
is associated an oivner o(s) which is one of the two players Max and Min. Each 
transition s^d!^t has an action A and a probability p S Q n [0,1 ], with the condition 
that, for each sfafe s, the probabilities of the transitions exiting s associated to the 
same achon must sum to 1. We say that the action A is available at state s if one 
of the transitions exiting s is associated to A. Furthermore to each action A is 
associated a reward r(A] e Q. 

A play of a stochastic game G begins in some state sq and produces an unending 
sequence of states {stligiq and actions {Aijtgjq. At move i, the owner of the 
current state chooses an action A^ among those available at st, then one of 
the transitions exiting si with achon At is selected at random according to their 
respective probabilities, and the next state st+1 is the deshnahon of the chosen 
transition. A play can be evaluated according to the (3-discounted payoff criterion 


CXD 


vp (Ao,Ai...) = (1-(3)^r(At)3^ 


1=0 


for P € [0,1 ]. Or it can be evaluated according to the mean payoff criterion 



1=0 


The goal of Max is to maximize the evaluation, that of Min is to minimize it. It is 
known that for both criteria there are optimal strategies which are positional [Gil57, 
LL69], namely such that the action chosen at si depends only on the state S| - 
an not, for instance, on the preceding states in the play, on i, or on a random 
choice. Given two positional strategies cr and t for Max and Min respechvely, 
and given |3 e [0,1], we denote vp(G,so, cr, t) the expected value of vp on all 

Laboratoire d'Informatique de l'Ecole Polytechnique (lix), Batiment Alan Turing, 1 rue 
Honore d'Estienne d'Orves, Campus de l'Ecole Polytechnique, 91120 Palaiseau, Prance. 

E-mail address'. maminoOlix.polytechnique.fr. 

Date- i 6 vi-2015. 

The author has received funding from the European Research Council under the European Commu¬ 
nity's Seventh Framework Programme (FP7/2007-2013 Grant Agreement no. 257039). 


1 





2 


M. MAMINO 


plays generated by cr and t starting from sq. We write vp (G, sq) for vp (G, sq, O/T) 
wifh cr and x opfimal. For basic informafion on stochastic games one may refer to 
fhe book [FV 97 ]. 

Given a stochastic game with probabilities and rewards encoded in binary, and 
a value of 3 also encoded in binary, it makes sense to study the computational 
complexity of the task of solving the game. Strategically solving a game, as defined 
in [AM 09 ], means to find a pair of optimal strategies. Quantitatively solving G 
means to find vp (G, s) for all states s. In general, the second task is easier than 
the first. The strategy recovery problem is, given the quantitative solution of a game, 
to produce a strategic solution. It has been observed in [AM 09 ] that this task 
can be performed trivially in linear time for discounted payoff games, and also, 
but not trivially, for terminal payoff and simple stochastic games, hence it was 
asked whether the same could be done for stochastic mean payoff games (this is, 
indeed, the only missing element to complete Andersson and Miltersen's picture). 
Our aim is to prove that the strategy recovery problem for stochastic mean payoff 
games is as hard as if possibly can. 

Theorem 1 . The strategy recovery problem for stochastic mean payoff games is equivalent, 
modulo polynomial time Turing reductions, to the task of strategically solving mean payoff 
games. 

We will combine fhe reducfion from sfochasfic mean payoff to discounted payoff 
games proven in [AM 09 ] with a new reduction from discounted to mean payoff 
games of a special form that we call (3-recurrent. Then we will show that (3- 
recurrent mean payoff games can be turned into strategically equivalent mean payoff 
games having the additional property that all states have value 0. For this latter 
class of games, fhe sfrategy recovery problem is obviously equivalent to solving 
the games strategically. 

Definition 2 . Let G be a stochastic game and sq one of fhe sfafes of G. We define 
the ^-recurrent game associated to G and sq, denoted Gp^s^. The game Gp^so 
the same state-space as G. Each transition n '^'^’^ h in G is replaced by two new 
transitions in Gp^so/ namely aJ^d^b and The first of these new 

transitions will be called of the first kind, the second of the second kind. We say that a 
game is ^-recurrent if if resulfs from fhe construcfion jusf defined, for some G. 

Notice that our ^-recurrent games are ergodic in the sense of [BEGMio]. The 
complexity of ergodic games has been settled in a recent work [Cljiqa] (see the 
full version [Cljiqb]), however we need for our reduction the extra properties 
of (3-recurrent games. Interestingly, the definition of ergodic in [Cljiqa] is more 
restrictive than that in [BEGMio], and, in particular, in this stronger sense, a (3- 
recurrent game may not be ergodic, nor an ergodic game needs to be (3-recurrent. 

Lemma 3 . The task of quantitatively solving stochastic discounted payoff games is 
polynomial time Turing reducible to quantitatively solving ^-recurrent stochastic mean 
payoff games. 
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Proof. Consider a stochastic game G and discount factor |3. Let so denote a state 
of G. We will show fhaf 

vp (G,so) =vi (Gp,S(),so) 

Infuifively, an infinife play of Gp can be seen as a sequence of finife sub-plays, 
each of which lasfs unfil a fransifion of fhe second kind is faken and fhe game is 
reset to the initial state sq. Each sub-play lasts at least one move, but a second 
move is played only with probability |3, a third one with probability |3^, and so 
on, thus imitating the discounted payoff sifuafion. 

In order to prove the proposition, it suffices to show that, for any pair of positional 
strategies cr and t for Max and Min respecfively, one has 

W Vi (Gp^so/So/i^/T) =vp (G,so,(J,t) 

In facf, if follows from fhis equafion fhaf cr and t are a pair of optimal positional 
sfrafegies for Gp^s^ if and only if fhey are a pair of opfimal posifional sfrafegies 
for G wifh sfarfing posifion sq. 

It remains to prove equation {*). For each state s of G, call Acj,x(s) the action chosen 
by either cr or x (according to the owner of s) af the state s. The |3-discounted 
values of fhe sfafes of G are defermined by fhe condition 

Vp (G,S,CT,t) = (1 - |3)r (Aa,T (s)) + Y_ PP(r,T (s t)vp (G,t,ff,T) 

t€G 

where pa,T(v —> w) denotes the probability that, from state s, a transition to state t 
is chosen when playing strategy cr against x. If we call sq ■.. the states of G and 
Pp = (vp (G, Si, cr, x))i=i ...Ti the value vector of G, then the condition above can be 
rewritten in the form 

Vp = (1 - (3)f-h (3Pvp 

where f is fhe vector of fhe rewards ft = r(A(j .T-(si)), and P denotes fhe mafrix of 
fhe fransifion probabilities Py = Pa,T(si —> Sj). Hence 

vp = (l-|3)(I-pP)-’f 

where I denofes fhe n x n identify mafrix. 

Now we turn our attention to the mean payoff of fhe pair of sfrafegies cr and x 
in G p^sg. We can compute v-| {G p^s^, sq, u, x) averaging the rewards over the stable 
distribution of fhe Markov chain induced by these strategies on the states of G. 
This sfable disfribufion p. musf be unique, because, by virtue of Gp^so being 
(3-recurrent, the Markov chain is connected. Moreover p is determined by the 
condition 

p (s) = (1 - (3) 6 so (s) + ^ |3pa,T (t, s) p (t) 

t€G 

where 65 ^( 5 ) is 1 if s = sq and 0 ofherwise. Rewriting as above, we gef 

p = (1 - (3) eo -h |3P^p 

where eo is fhe firsf elemenf of fhe canonical basis and \li = p(si). Hence 

p=(l-|3)(l-ppT)”’eo 
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Now, computing the average 

Vl (G|3 ,so/So,0-,t) = 


= V|3 (G,So, o-,t) □ 

Lemma 4 . The task of strategically solving ^-recurrent stochastic mean payoff games is 
polynomial time many-one reducible to the strategy recovery problem for stochastic mean 
payoff games. 

Proof. Let G be a (3-recurrent stochastic game. As we noticed, all the states 
of Gp So have the same value. Nevertheless, we have no obvious way to determine 
this value in order to complete the reduction. Instead, we choose to construct a 
new mean payoff game G' in such a way fhaf all fhe sfafes of G ' gef mean payoff 
value equal fo 0, and nonefheless a pair of optimal sfrafegies for Gp^so can be 
recovered from a pair of opfimal sfrafegies for G'. This is clearly sufficienf fo 
esfablish fhe lemma. 

The game G' is constructed as two chained copies G^ and G^ of Gp^so' redirecting 
all the transitions of fhe second kind in each insfance - thaf go to the state 
corresponding to sq in that instance - to the sq- state in the other. The states 
of G' have fhe same owner as in Gp s^, and fhe fransifions originafing in G^ are 
associated to the same actions with the same rewards as in Gp^s^. In G 2 , however, 
the owners are switched and the signs of the rewards exchanged (formally we 
replace each action A with a new one A' having r{A'] = —r{A)). If bofh players 
play optimally, we may expect each to win in G i precisely as much as he loses 
in G 2 , hence, arguably the value of G' should be 0. On the other hand, in order 
to play optimally in G', one should play optimally in both the components, so 
we should be able to extract optimal positional strategies for G p sq from opfimal 
positional sfrafegies for G' by mere restricfion fo fhe componenf G ^. We will now 
proceed fo prove our sfafemenf. 

Let us denote by and respectively the states of G^ and G^ corresponding to a 
given state s of Gp^s^. First observe that a play of G', almost surely, will eventually 
reach state sj, from fhis follows thaf all the states of G' must have the same value 
(G' is ergodic). A positional strategy cr for Max in G' can be seen as a pair of 
positional strategies (cr', cr^) where o' is the strategy for Max in G p^so that we get 
restricting cr to G', and cr^ is the strategy for Min in Gp sq thaf we get from fhe 
resfricfion of cr to G^ (remember that in G^ the players are switched). Similarly a 
strategy x for Min in G' can be seen as a pair of sfrafegies (t',t^) in Gp^s^, the 
first one for Min and fhe second for Max. We will prove fhaf for any cr and x 

(**) Vi (G', •, CT, x) = jV-i (^G, •, ct' , x' ^ - jV-i (^G, •, x^, 

From this equation, it follows af once thaf cr is an opfimal sfrategy for G' if and 
only if (cr',cr^) is a pair of opfimal sfrafegies for Gp^s^, and, in particular, the 
value of G' is 0. 


Y_ p(s)r(Aa,T (s)) 

G 

fi^f 

eJ(1-(3)(l-pP)-’f 
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We turn now to the proof of equation (**). Consider the unique stable distribu¬ 
tion q of the Markov process induced by cr and x. Observe that, independently 
from cr and x, at any given state, our Markov chain has probability |3 of transition¬ 
ing to a state belonging to the same component, and probability 1 — |3 of switching 
component. It follows that the sequence of the components must obey the law of 
a two-state Markov chain with transition matrix 

P 1 -p 
1 -p p 

Hence q(G'] = q(G^) = 1/2. It suffices to prove that the probability distribu¬ 
tions and q^ defined on the states of Gp^so Py (s) = 2 q(s^) and q^ = 2 q(s^) 
are the stable distributions induced on Gp so Py ^^P^ pairs of strategies (cr^,x^) 
and (x^, cr^) respectively. 

By symmetry, we can concentrate on q^. Let s) denote the probability of 

the transition t —r s in the Markov process induced by the strategies cr and x. Since 
all states of G^ except Sq are only reachable from within G^ itself, the consistency 
equation for q being a stable distribution on G ' 

M-(s] = ^ pa,T (t,s) q(t) 
t€G' 

implies the same condition for q' at all states except sq. At sq one concludes by 
direct computahon observing that the component of the sum on the right hand 
side due to transitions of the second kind must be 

(1-P)q(G2) =l^ = (l-p)q(G’) □ 

Proof of Theorem i. By [AM 09 , Theorem 1 ], solving stochastic mean payoff games 
strategically is reducible to solving stochastic discounted payoff games quantita¬ 
tively, which reduces, by Lemma 3 , to solving P-recurrent stochastic mean payoff 
games quantitatively. In turn, solving such (3-recurrent games quanhtatively is 
reducible to solving the same strategically, just because they are, in particular, 
stochastic mean payoff games. By Lemma 4 , this final task is reducible to the 
strategy recovery problem for stochastic mean payoff games. □ 

Finally, we would like to remark that our construction relies on the interpretation 
of strategic solution as requiring optimal positional strategies. Were a more general 
class of strategies available, then the problem of finding an optimal one would 
become easier. In particular, the games produced by Lemma 4 happen to be 
symmetric imder switching the players and the signs of the rewards. Under this 
circumstance, it would not be surprising if one could play optimally by some form 
of strategy stealing technique. 
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